Imagine we’re some fancy data scientists exploring - once again - the gapminder data. We’re particularly interested in the development of the GDP across time and across countries. Some R-fanatics from GESIS recommended using this tidyverse thing to complete our tasks. At the same time, they also hesitate to load all of its R-packages at once.
tidyverse for importing Excel data and for data wrangling.
library(readxl)
library(dplyr)
library(tidyr)
Ok, that wasn’t too hard. But data science is about data, so we have to load in the data.
sheet = "name_of_your_sheet"
gapminder_example <-
readxl::read_excel(
path = "../data/gapminder/GDPpercapitaconstant2000US.xlsx",
sheet = "Data"
)
Have the data been successfully imported? They should comprise a tibble of 275 x 53. Furthermore, the income per person for Algeria of the years 1960, 1961, and 1962 should be 1280, 1085, and 856.
select() and by filtering by number with slice().
gapminder_example %>%
select(1:4) %>%
slice(n = 5)
## # A tibble: 1 x 4
## `Income per person (fixed 2000 US$)` `1960.0` `1961` `1962`
## <chr> <dbl> <dbl> <dbl>
## 1 Algeria 1280. 1085. 856.
Let’s say we’re interested in the earliest 10 years of development in all countries and in the most recent 10 years. The idea is that there might be some differences between the early days and the new days of GDP development. At first, we’d like to compute such statistics across all countries. Unfortunately, the data are in the wide format.
gather(). Additionally, you might want to create a more convenient column name for the variable Income per person (fixed 2000 US$) with rename() as its really messy.
gapminder_example %>%
rename(country = `Income per person (fixed 2000 US$)`) %>%
gather(-country, key = "year", value = "GDP")
## # A tibble: 14,300 x 3
## country year GDP
## <chr> <chr> <dbl>
## 1 Abkhazia 1960.0 NA
## 2 Afghanistan 1960.0 NA
## 3 Akrotiri and Dhekelia 1960.0 NA
## 4 Albania 1960.0 NA
## 5 Algeria 1960.0 1280.
## 6 American Samoa 1960.0 NA
## 7 Andorra 1960.0 NA
## 8 Angola 1960.0 NA
## 9 Anguilla 1960.0 NA
## 10 Antigua and Barbuda 1960.0 NA
## # … with 14,290 more rows
Ok, did it work out? There are still a lot of missing values we might get rid of, and the data are not arranged properly. They make the data untidy, distract us and are not part of any mean calculations anyway. For the next upcoming tasks, simply re-use your code and add the following commands with the %>%.
filter() in combination with !is.na.
gapminder_example %>%
rename(country = `Income per person (fixed 2000 US$)`) %>%
gather(-country, key = "year", value = "GDP") %>%
filter(!is.na(GDP)) %>%
arrange(year, GDP)
## # A tibble: 7,988 x 3
## country year GDP
## <chr> <chr> <dbl>
## 1 Malawi 1960.0 98.6
## 2 China 1960.0 105.
## 3 Burundi 1960.0 116.
## 4 Burkina Faso 1960.0 122.
## 5 Lesotho 1960.0 132.
## 6 Nepal 1960.0 139.
## 7 Togo 1960.0 177.
## 8 India 1960.0 181.
## 9 Pakistan 1960.0 187.
## 10 Indonesia 1960.0 201.
## # … with 7,978 more rows
Nice. Now we got a - more or less - clean dataset for our actual task: calculating the mean values across all countries for each of the first ten years and each of the last ten years. What’s still a little bit distracting is that we got the values for all years between these two periods in the data. But we decided that we leave them there for some future analyses. As such, we do all analyses on the fly. Let’s start with the first period.
GDP across all countries for each of the first ten years.
double, you can simply filter the range of years you are interested in.
gapminder_example %>%
rename(country = `Income per person (fixed 2000 US$)`) %>%
gather(-country, key = "year", value = "GDP") %>%
filter(!is.na(GDP)) %>%
arrange(year, GDP) %>%
filter(year >= 1960 & year <= 1969) %>%
group_by(year) %>%
summarise(GDP_over_all_countries = mean(GDP))
## # A tibble: 10 x 2
## year GDP_over_all_countries
## <chr> <dbl>
## 1 1960.0 2863.
## 2 1961 2936.
## 3 1962 3034.
## 4 1963 3127.
## 5 1964 3297.
## 6 1965 3479.
## 7 1966 3562.
## 8 1967 3644.
## 9 1968 3827.
## 10 1969 4015.
After this was done, you might know how to do that for the 10 most recent years…
GDP across all countries for each of the last ten years.
gapminder_example %>%
rename(country = `Income per person (fixed 2000 US$)`) %>%
gather(-country, key = "year", value = "GDP") %>%
filter(!is.na(GDP)) %>%
arrange(year, GDP) %>%
filter(year >= 2002 & year <= 2011) %>%
group_by(year) %>%
summarise(GDP_over_all_countries = mean(GDP))
## # A tibble: 10 x 2
## year GDP_over_all_countries
## <chr> <dbl>
## 1 2002 7983.
## 2 2003 8113.
## 3 2004 8335.
## 4 2005 8545.
## 5 2006 8899.
## 6 2007 9219.
## 7 2008 8999.
## 8 2009 8463.
## 9 2010 7700.
## 10 2011 7603.